Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure that the data types are supported in hashjoin before genera… #2702

Merged
merged 5 commits into from
Jun 9, 2022

Conversation

HuSen8891
Copy link
Contributor

Which issue does this PR close?

Closes #2145

Rationale for this change

Before generating the hash join logical plan, make sure the data types in equal conditions are supported. Otherwise, try
cross join instead.

What changes are included in this PR?

  1. add can_hash func in datafusion/expr/src/utils.rs
  2. use can_hash in datafusion/expr/src/logical_plan/builder.rs and datafusion/sql/src/planner.rs to check if the data type is supported in hashjoin or not

…ting hashjoin logical plan.

If data types are not supported in hashjoin, try cross join with filters.
@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions sql SQL Planner labels Jun 6, 2022
@codecov-commenter
Copy link

codecov-commenter commented Jun 6, 2022

Codecov Report

Merging #2702 (b9b9434) into master (9306534) will decrease coverage by 0.01%.
The diff coverage is 79.06%.

@@            Coverage Diff             @@
##           master    #2702      +/-   ##
==========================================
- Coverage   84.66%   84.64%   -0.02%     
==========================================
  Files         270      270              
  Lines       46919    47037     +118     
==========================================
+ Hits        39724    39815      +91     
- Misses       7195     7222      +27     
Impacted Files Coverage Δ
datafusion/expr/src/logical_plan/builder.rs 89.35% <55.55%> (-2.60%) ⬇️
datafusion/expr/src/utils.rs 90.39% <68.18%> (-1.48%) ⬇️
datafusion/sql/src/planner.rs 81.33% <69.23%> (-0.19%) ⬇️
datafusion/core/tests/sql/joins.rs 99.07% <100.00%> (+0.13%) ⬆️
datafusion/row/src/lib.rs 100.00% <0.00%> (ø)
datafusion/row/src/layout.rs 98.27% <0.00%> (ø)
datafusion/row/src/reader.rs 88.27% <0.00%> (ø)
datafusion/row/src/writer.rs 89.86% <0.00%> (ø)
datafusion/row/src/accessor.rs 56.41% <0.00%> (ø)
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9306534...b9b9434. Read the comment docs.

Comment on lines 630 to 636
Some(filter_expr) => {
filters = Some(Expr::BinaryExpr {
left: Box::new(expr),
op: Operator::And,
right: Box::new(filter_expr),
});
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Some(filter_expr) => {
filters = Some(Expr::BinaryExpr {
left: Box::new(expr),
op: Operator::And,
right: Box::new(filter_expr),
});
}
Some(filter_expr) => filters = Some(and(expr, filter_expr)),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I'll fix this.

Comment on lines 623 to 627
let expr = Expr::BinaryExpr {
left: Box::new(Expr::Column(l.clone())),
op: Operator::Eq,
right: Box::new(Expr::Column(r.clone())),
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let expr = Expr::BinaryExpr {
left: Box::new(Expr::Column(l.clone())),
op: Operator::Eq,
right: Box::new(Expr::Column(r.clone())),
};
let expr = binary_expr(
Expr::Column(l.clone()),
Operator::Eq,
Expr::Column(r.clone()),
);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I'll fix this.

@@ -19,6 +19,7 @@

use crate::expr_rewriter::{normalize_col, normalize_cols, rewrite_sort_cols_by_aggs};
use crate::utils::{columnize_expr, exprlist_to_fields, from_plan};
use crate::Operator;
Copy link
Member

@andygrove andygrove Jun 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some suggestions below that require these additional imports

Suggested change
use crate::Operator;
use crate::{and, binary_expr, Operator};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I'll fix this.

/// If more data types are supported in hash join, add those data types here
/// to generate join logical plan.
pub fn can_hash(data_type: &DataType) -> bool {
match data_type {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decimal should be supported here

Copy link
Contributor Author

@HuSen8891 HuSen8891 Jun 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data types here come from function equal_rows, decimal is not supported in equal_rows so that hash join currently does not support joining on columns of decimal data type. That's why decimal is not here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may help to add a comment here (or in equal_rows) mentioning they need to remain in sync

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may help to add a comment here (or in equal_rows) mentioning they need to remain in sync

I'll add this comment.

@HuSen8891 HuSen8891 requested a review from andygrove June 8, 2022 00:18
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @AssHero

I believe this PR does what the description says -- and it adds nice test coverage so I am approving it.

However, I am concerned that trying to run a query using a CrossJoin is likely to be catastrophically bad so this approach will not work for anything except very small inputs.

Comment on lines +1230 to +1232
// join on hash unsupported data type (Date32), use cross join instead hash join
let sql = "select * from foo t1 join foo t2 on t1.c4 = t2.c4";
let msg = format!("Creating logical plan for '{}'", sql);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think CrossJoin is almost never what the user would want: as once the tables get beyond any trivial size the query will effectively never finish or will run out of memory. An error is clearer.

From the issue description #2145 (comment) I think @pjmore's idea to cast unsupported types to a supported type is a good one -- the arrow cast kernels are quite efficient for things like Date32 -> Int32 (no copies) as the representations are the same

@pjmore what do you think?

Copy link
Contributor Author

@HuSen8891 HuSen8891 Jun 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think CrossJoin is almost never what the user would want: as once the tables get beyond any trivial size the query will effectively never finish or will run out of memory. An error is clearer.

From the issue description #2145 (comment) I think @pjmore's idea to cast unsupported types to a supported type is a good one -- the arrow cast kernels are quite efficient for things like Date32 -> Int32 (no copies) as the representations are the same

@pjmore what do you think?

I agree. supporting more data types in hash join is the better way to solve this issue, and i'm already working on it.
And this pr only wants to make hash unsupported join running in cross join instead of error/panic, we can support more data types in hash join continuously.

/// If more data types are supported in hash join, add those data types here
/// to generate join logical plan.
pub fn can_hash(data_type: &DataType) -> bool {
match data_type {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may help to add a comment here (or in equal_rows) mentioning they need to remain in sync

@alamb
Copy link
Contributor

alamb commented Jun 9, 2022

Thank you again @AssHero

@alamb alamb merged commit 67d91a7 into apache:master Jun 9, 2022
@HuSen8891 HuSen8891 deleted the stat branch June 9, 2022 12:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Join on path partitioned columns fails with error
5 participants